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Abstract 

Background: The TNM staging system is based on three anatomic prognostic factors: Tumor, Lymph Node and 
Metastasis. However, cancer is no longer considered an anatomic disease. Therefore, the TNM should be expanded 
to accommodate new prognostic factors in order to increase the accuracy of estimating cancer patient outcome. 
The ensemble algorithm for clustering cancer data (EACCD) by Chen ef al. reflects an effort to expand the TNM 
without changing its basic definitions. Though results on using EACCD have been reported, there has been no 
study on the analysis of the algorithm. In this report, we examine various aspects of EACCD using a large breast 
cancer patient dataset. We compared the output of EACCD with the corresponding survival curves, investigated 
the effect of different settings in EACCD, and compared EACCD with alternative clustering approaches. 

Results: Using the basic Tand N definitions, EACCD generated a dendrogram that shows a graphic relationship 
among the survival curves of the breast cancer patients. The dendrograms from EACCD are robust for large values 
of m (the number of runs in the learning step). When m is large, the dendrograms depend on the linkage 
functions. 

The statistical tests, however, employed in the learning step have minimal effect on the dendrogram for large m. 
In addition, if omitting the step for learning dissimilarity in EACCD, the resulting approaches can have a degraded 
performance. Furthermore, clustering only based on prognostic factors could generate misleading dendrograms, 
and direct use of partitioning techniques could lead to misleading assignments to clusters. 

Conclusions: When only the Partitioning Around Medoids (PAM) algorithm is involved in the step of learning 
dissimilarity, large values of m are required to obtain robust dendrograms, and for a large m EACCD can effectively 
cluster cancer patient data. 



Background 

Accurate outcome (survival) estimation is often the key 
in the successful treatment of cancer patients. Estima- 
tion depends on clinical or laboratory variables or fac- 
tors that are linked to patient outcome. Found in all 
specialties of medicine, predictive factors take on signifi- 
cant clinical meaning when treatment options are avail- 
able, but they become more important if treatment 
options are limited and not always effective. 

Currently, the most common predictive factors in can- 
cer medicine are the three variables T , N, and M of the 
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TNM (Tumor, Lymph Node, and Metastasis) staging sys- 
tem that define the anatomic extent of disease [1]. The 
"T" usually refers to the size of the primary tumor, "AT 
refers to the presence or absence of metastatic deposits 
in regional lymph nodes, and "M" indicates the presence 
of metastatic disease. With the TNM staging system, 
levels of these three variables are combined, and patients 
are classified into four stage groups according to different 
combinations of the levels. Then the outcome estimation 
of patients is based on the survival function estimated for 
each stage. 

The TNM was created by surgeons primarily for sur- 
gery. However, cancer medicine no longer lives in the 
world where surgery remains the only treatment. The field 
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of cancer is now characterized by screening and early 
detection, proteogenomics, multiple therapies, and a 
bewildering array of prognostic factors. Advances in mole- 
cular medicine, imaging, and therapeutics are now forcing 
us to integrate additional prognostic factors for more 
accurate estimation of patient outcome [2-5]. Therefore, 
to improve the estimation of outcome, methods are 
needed to incorporate additional prognostic factors into 
the TNM without changing the anatomic definitions. 

The ensemble algorithm for clustering cancer data 
(EACCD) by Chen et al. [6] is designed to explore expan- 
sion of the TNM by integrating additional factors into 
the system. Though many results on using EACCD have 
been reported, there has been no study available to ana- 
lyze the algorithm. In this report, we present an analysis 
of EACCD by using a large breast cancer dataset. We 
compared the output of EACCD with the corresponding 
survival curves, investigated the effect of different settings 
for EACCD, and compared EACCD with several other 
clustering approaches. This report represents an exten- 
sive expansion of the work in [7]. 

Method 

EACCD 

In this section, we describe the EACCD. Our presenta- 
tion allows a collection of partition methods in con- 
structing dissimilarities and thus is more general than 
that in [6]. Let the record for the ith patient be {x i0 ,Xi lt ..., 
Xi P ,8j), where x i0 equals the observed time (censored or 
un-censored survival time), Xy are measurements on vari- 
ables (factors) Xj for /' = 1, ... , p, and 5j is the event indi- 
cator which is defined to be 1 if the event (e.g., death) 
has occurred and 0 if the time on study is right-censored. 
Define a combination to be a set oi{x i0 ,x il ,...,x ip ,§D that 
corresponds to one level of each variable (A continuous 
variable should be discretized). EACCD is an algorithm 
used to cluster combinations. In the algorithm, dissimi- 
larity between two combinations is learnt by repeatedly 
using some clustering (partitioning) approaches based on 
criterion minimization, and then the learnt dissimilarity 
measure is used with a hierarchical clustering method in 
order to find final clusters of combinations. The algo- 
rithm involves the following three steps. 

Computing initial dissimilarity 

Assume that there are a total of n combinations Xi, x 2 , ... , 
x„. Then the following initial dissimilarity measure 
d!So(x;,x,'') is defined between two combinations x,- and x,->: 



diso(x;,x;<) = do- 



(1) 



Here d 0 is the value of a test statistic (e.g., the log-rank 
test statistic [8]) used to determine if three is a difference 
in the survival functions between the two populations 



associated with x, and x ; <. In general, disa{xi, x;/) assumes 
any non-negative value. 

Computing learnt dissimilarity 

Let C denote a cluster assignment, assigning the ith com- 
bination to a cluster, i.e., C(x ; ) e ( {1, 2, ... ,I<] for a prede- 
termined integer K. The optimal assignment C* is 
obtained by minimizing the "within-cluster" scatter, i.e., by 
solving the following discrete optimization problem: 



m "ll2 12 & o(x„x, f; ). 
CA ' k)l k=l c(*,o=* 



(2) 



Numerical procedures (e.g., the Partitioning Around 
Medoids (PAM) [9]) are employed to find the solution to 
the above optimization problem. For the data {x 1; x 2 , ... , 
x„}, one K and one clustering or partitioning method may 
be chosen to partition the data into K clusters. However, 
the final assignment usually depends on the selected 
method and the initial reallocation. To overcome this, one 
can run this partition process m times. Each time a num- 
ber K is randomly picked from a given interval [K lt K?\ and 
a partitioning procedure is also randomly selected. Define 
Si(i, j) = 1 if the /th run of a procedure does not assign x, 
and Xj into the same cluster; and S : (i, j) = 0 otherwise. And 
then define the following dissimilarity measure between 
two combinations x, and Xf. 



dis(xi,Xj) = 



(3) 



m 



Note that dis{x it xj) ranges from 0 to 1. A smaller value 
of dis{xi, Xj) indicates that x, and x y most likely come from 
the same "hidden" group. In other words, a smaller dissim- 
ilarity dis{x it Xj) is expected to imply a smaller difference 
between the two survival functions associated with the 
two combinations. 

Hierarchical clustering 

This step clusters the combinations by applying a linkage 
method [10] and the learnt dissimilarity dis(xi, xj). The pri- 
mary output of EACCD is a dendrogram that provides a 
summary of the survival experiences based on the levels of 
prognostic factors, and thus has multiple applications. 

The algorithm is outlined in Algorithm 1. Note that if 
only PAM is used for computing the learnt dissimilarity, 
then the algorithm reduces to that introduced in [6]. 

Data set 

A breast cancer patient dataset was obtained from the Sur- 
veillance, Epidemiology, and End Results (SEER) Program 
of the National Cancer Institute [11]. Because of its size, 
quality control, broad US representation, unbiased ascer- 
tainment, and 35-year history, the Program is ideal for 
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evaluating algorithms. We selected data for breast cancer 
from the years 1990-2000 using SEER's Case Listing. Dur- 
ing the selection process, we followed the definitions for 
tumor size and number of involved lymph nodes as pub- 
lished by the American Joint Committee on Cancer [1]. 
The dataset contained 202, 219 cases having complete 
records on T (tumor size), N (nodal status), X (survival 
time), and 8 (censoring status). The factors T and N have 
3 and 4 categories, respectively, as listed in Table 1. There- 
fore there are 12(3 x 4) combinations based on T and N. 
And for convenience, we denoted by TIM) the combina- 
tion formed using categories Tl and NO, by T1N1 the 
combination formed using categories Tl and A/1, and so 
on. 

Algorithm 1 Ensemble algorithm for clustering cancer 
patient data 

1. Define the initial dissimilarity dis 0 in (1). 

2. Obtain a collection of procedures for solving (2). 
Choose m, K lt and K 2 , and run these procedures m 
times, where for each time, a procedure is randomly 
selected from the collection and a K is randomly cho- 
sen from the interval [K v K 2 ] - Then construct the pair- 
wise dissimilarity measure dis by using the equation (3). 

3. Cluster the combinations by applying a linkage 
method and the learnt measure dis. 



Evaluation of EACCD 

We evaluated EACCD by performing a series of experi- 
ments using the programming language "R" [12]. The 
PAM algorithm was used in the second step of EACCD 
throughout the evaluation. Random medoids were initially 
selected for the PAM in all cases except for A 4 , described 
below, where the default initial medoids in "R" were used. 

The evaluation began with the application of the 
algorithm to clustering the breast cancer patients. We 
examined how the algorithm grouped the patients and 
compared this grouping with the possible grouping pattern 
exhibited in the survival curve plot. For the experiments, 
the log-rank test statistic [8] was used to determine the 
initial dissimilarity in the first step of the algorithm. In the 
second step we chose K x = 2, K 2 = 11 (the total number of 



Table 1 Definitions of Tand N for SEER breast cancer 



cases from 1990-2000. 


Prognostic factors 


Categories 


Level 


Tumor size 


7~1 (7" < 2cm) 


1 




72(2cm < T < 5cm ) 


2 




73(7 >5cm) 


3 


Nodal status 


N0(No positive axillary nodes) 


1 




/VI (1 - 3 nodes contain tumor) 


2 




N2(4 - 10 nodes contain tumor) 


3 




N3(More than 10 nodes contain tumor) 


4 



combinations minus one). The PAM algorithm was 
repeatedly executed for m = 10000 times. In the third step, 
the average linkage hierarchical clustering technique [10] 
was used. 

We then examined the effect of different settings in 
EACCD on the dendrogram generated by the algorithm. 
There were mainly three "factors" that could influence the 
final result in EACCD: test (the statistical test employed in 
determining the initial dissimilarity in Step 1 of the algo- 
rithm), m (the number of rounds of partitioning proce- 
dures performed in obtaining the learnt dissimilarity in 
Step 2) and the linkage function (the linkage function used 
in the hierarchical clustering procedure in Step 3). The 
effects of these "factors" were analyzed by varying their 
"values." While the value of m was chosen from {10, 20, 
50, 100, 500, 1000, 5000, 10000, 20000, 30000}, we consid- 
ered three tests (the log-rank test, the Gehan-Wilcoxon's 
test, and the Tarone and Ware's test [8]) and three linkage 
functions (the average linkage, the complete linkage, and 
the single linkage [10]). 

Finally, we compared EACCD with four additional 
approaches that could be used to cluster the cancer 
patient data. These approaches were either straight for- 
ward or modifications of EACCD. Specifically the four 
approaches A1A2A3A4 are described below. For demon- 
stration, we used m = 10000, the log-rank test, and the 
average linkage for the setting of EACCD. 
Approach A, 

This was tailored from the EACCD, omitting the learn- 
ing step for dissimilarity. The initial dissimilarity mea- 
sure dis 0 in (1) was obtained first using the log-rank 
test and then standardized into 0[1] by the equation 
dis s Ai = diso/max {diso}. The standardized initial dissimi- 
larity values were then used in the hierarchical cluster- 
ing procedure with the average linkage function. 
Approach A 2 

In testing the differences between two survival curves 
associated with two combinations, a smaller p-value nor- 
mally indicates a larger difference between the survival 
curves. Therefore, 1 - p, ranging from 0 to 1, could be 
used as the pairwise dissimilarity measure between two 
combinations in light of the survival. In the approach of 
A 2 , this dissimilarity 1 - p, from the log-rank test, was 
directly used in the hierarchical clustering procedure with 
the average linkage function. The learning step for dissimi- 
larity was not required. 
Approach A 3 

In A 3 , we considered one traditional procedure in cluster- 
ing the cancer data by using the two factors T and N. For 
each combination, let f denote the average value of T and 
£] the average value of N. We could use f and £j to repre- 
sent the T and N value of the combination, respectively. 
Since f has a much larger range than fj, a linear transfor- 
mation was performed to standardize f and fj into 0[1] as 
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ft = (f - min{j'})/(max{f}- min{f}) and fts = (ft ~ mm 
{ft})/(max{ft}- min{ft}). Let Tf and ft*, be the standardized 
values for combination x 2 . Then the dissimilarity 
between combinations X; and Xj was defined as <i/s(x„ 
x y ) dis(xi,Xj) = |7| = 7JI + N| — Nj\. This dissimilarity dis 
was then standardized into the range of 0[1] using 
dis s Ai = dis/ max{dis}. Based on dis Ai , hierarchical clus- 
tering with the average linkage was then performed. 
Approach A 4 

In A 4 the PAM clustering algorithm was directly used to 
partition the cancer data. The quantity dis s A in the 
approach A± was taken as the input dissimilarity measure- 
ment. The number of clusters was set at 2, ... , 11, respec- 
tively, and thus 10 partition results were available. 

Results and discussion 

An application study 

EACCD, when applied to the breast cancer data, gener- 
ated a dendrogram (Figure 1(a)) that exhibits one rela- 
tionship among 12 survival curves corresponding to the 
12 combinations. 



More specifically, the dendrogram provided an overall 
view of the relationship among the outcomes as the levels 
of prognostic factors were changed. We begin with the 
leftmost side or branch of Figure 1(a). The dissimilarity 
(difference) between the survival curve of T1N3 and the 
survival curve of T3N2 is 0.20. Merge T1N3 with T3N2 
and denote by T1N3 + T3N2 the resulting group of 
patients. Then the difference between the survival curve 
of T1N3 + T3N2 and the survival curve of T2N3 is 0.41. 
Merge 7UV3 + T3N2 with T2N3 and denote the resulting 
group of patients by T1N3 + T3N2 + T2N3. Then in light 
of survival, this group T1N3 + T3N2 + T2N3 differs from 
T3N3 by a value of 0.67. Merging T3N3 with T1N3 + 
T3N2 + T2N3 and denoting the resulting group by T1N3 
+ T3N2 + T2N3 + T3N3, then ^2^2 + 73JV1 differs from 
TIN3 + T3N2 + T2N3 + T3N3 by a value of 0.70 in 
terms of survival. Here T2N2 + T3N1 is the group from 
merging T2N2 with T3N1, where T2N2 differs from 
T3N1 by a value of 0.00. Denote by T1N3 + T3N2 + 
T2N3 + T3N3 + T2N2 + T3A^1 the result from merging 
T2N2 + T3N1 and T1N3 + T3N2 + T^AB + T3N3. The 
above shows the relationship among the survival curves 
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hclusl C, "average") 



Survival Time in Months 



(a) Dendrogram of T and N based on log-rank test, average (b) Kaplan-Meier survival curves for twelve combinations of T 
linkage function, and m = 10000. and N . 

Figure 1 Dendrogram of T and N from EACCD and survival curves for T and N combinations. 
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of the combinations contained in the left branch of the 
dendrogram. A similar interpretation applies to the survi- 
val curves of the combinations in the right branch of the 
dendrogram. Finally, the left branch differs from the right 
branch by a value of 1.0 in light of survival. That is, 1.0 is 
the difference between the survival curve of the group 
TIN1 + T2NQ + T3N0 + T\N2 + T2N\ + TIM) and the 
survival curve of the group 71 A/3 + T3N2 + TIN'S + 
73 A/3 + T2N2 + 73 A/1. 

The relationship among the survival curves exhibited in 
the dendrogram of T and N (Figure 1(a) ) can be con- 
firmed by visually checking the 12 survival curves shown 
in Figure 1(b). These survival curves were constructed by 
the Kaplan-Meier procedure [8]. The survival curves in 
Figure 1(b) can be divided into two groups, group 1 con- 
sisting of the lower six curves and group 2 consisting of 
the upper six curves. The curves in group 1 and group 2 
appear on the left and right branches in Figure 1(a), 
respectively of the dendrogram. Thus, from a practical 
perspective, the dendrogram initially divides the patients 
into those with a favorable outcome and those with an 
unfavorable outcome. A visual check of group 1 in Figure 
1(b) shows certain differences among the curves. For 
instance, the two closest curves are the curve of T2N2 and 
the curve of 73A/1, and the next two closest curves are the 
curves of 71A/3 and 73A/2. If we merge combinations in 
the order of increasing differences between survival rates, 



we would first merge T2N2 with 73A/1, and then merge 
71A/3 with 73A/2, merge 71A/3 + 73A/2 with 72A/3, merge 
71A/3 + T3N2 + T2N3 with 73 A/3, and finally, merge 
TINS + 73 A/2 + 72 A/3 + 73 A/3 with T2N2 + T3N1. 
Clearly, this observation coincides with the relationship 
among survival curves depicted by the left branch of the 
dendrogram in Figure 1(a). Similarly, the right branch of 
the dendrogram captures the survival differences and the 
order of merging of the six curves in group 2. 

Effect of settings on EACCD 
Effect of m 

The learnt dissimilarity "dis" in EACCD depends on the 
values of m, which will be convergent when m is suffi- 
ciently large. If on the the other hand, m is small, the dis- 
similarity is not convergent and can be regarded as a 
variable. Thus, the resulting dendrograms will not be 
robust. Specifically, for a small value of m, multiple runs 
of EACCD with the same test and same linkage may pro- 
duce significantly different dendrograms. This is shown in 
Figures 2(a) and 2(b). However, when m is large, the den- 
drograms for the same test and same linkage are virtually 
the same. For example, when m = 10000, 20000, 30000, 
the dendrograms (Figures 3(d), (e), (f)) based on the 
Gehan-Wilcoxon's test and the complete linkage are simi- 
lar, and the dendrograms (Figures 3(g), (h), (i)) based on 
the Tarone-Ware's test and the single linkage are almost 



I I I 



5 fi 



§ 



dis dis 
Must c, "average") hclusl (*, "average") 



(a) m = 10 (b) m = 10 

Figure 2 Dendrograms from the log-rank test, the average linkage, and small m 
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(a) Log-rank, average, to = 10000. (b) Log-rank, average, m = 20000. (c) Log-rank, average, to = 30000. 
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(d) Gehan-Wilcoxon, complete, m 
10000. 
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20000. 



(f) Gehan-Wilcoxon, complete, m 
30000. 
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(g) Tarone-Ware, single, m = 10000. 
Figure 3 Dendrograms from large m values 



(h) Tarone-Ware, single, m = 20000. 



(i) Tarone-Ware, single, m = 30000. 



identical. Therefore, a large w should be used when apply- the Tarone and Ware's test, the average linkage, the com- 
ing EACCD. plete linkage, and the single linkage. There were two 
Effect of tests and linkage functions observations, drawn by visualizing the figure horizontally 
We further examined the effect of statistical tests for and vertically. First, for a given test, the dendrograms 
large values of m. Figure 4 lists nine dendrograms for based on different linkage functions exhibit the same 
m = 10000, the log-rank test, the Gehan-Wilcoxon's test, merging pattern, but merging or fusion can occur at 
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(h) Tarone-Ware, complete 



Figure 4 Dendrograms from m = 1000, three tests, and three linkage functions 
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(f) Gehan-Wilcoxon, single 



ft 



ft 

! I 



ft 
I i 



ft 



(i) Tarone-Ware, single 



significantly different dissimilarity values. For example, linkage, T^ZAQ + T3N1 is merged with 71JV3 + T3N2 + 
with the log-rank test, the dendrogram from the average T2N3 + T3N3 at the dissimilarity of 0.76. But that fusion 
linkage has the same shape and merging pattern as the occurs at the dissimilarity of 0.79 for the complete link- 
dendrogram from the complete linkage. For the average age. Second, for a given linkage, the dendrograms derived 
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from different tests are virtually the same, which indi- 
cates that for a given linkage, test statistics have minimal 
influence on the dendrogram. For instance, Figures 4(a), 
(d), and 4(g) essentially show the same dendrogram for 
the average linkage and three tests (the log-rank test, the 
Gehan-Wilcoxon's test, and the Tarone and Ware's test). 

In summary, our experiments have shown that a large 
m ( such asm> 10000 ) should be used in EACCD. For a 
large m, different linkage functions can generate different 
dendrograms. But different statistical tests have minimal 
or no influence on the dendrogram. 

Comparisons with alternative approaches 
Approach />, 

For approach A lt a hierarchical clustering procedure 
with the average linkage was applied directly to the 
breast cancer data. The dissimilarity was determined by 
the value of the log-rank test statistic. The dendrogram 
is shown in Figure 5(a). It indicates that TIM) becomes 
a separate group. The reason for this is stated as follows. 
Consider the set S containing all the dissimilarities 
between one survival function and its "nearest" neigh- 
bor, which is identified visually from Figure 1(b). Com- 
putation shows that the dissimilarity between TIM) and 
its nearest neighbor T1ATL is the maximum of S and it is 
nearly 12 times larger than the second largest value in S. 
According to the construction of the dendrogram, TIM) 
is merged with the group of all the other eleven combi- 
nations at the last step in the hierarchical clustering 
procedure. 

Note that the combination 7TM) contains significantly 
more patients than any other combination (Figure 1(b)). 
Other experiments showed that if the number of patients 
in TIM) was reduced to a quantity comparable with the 
number of patients in other combinations, dendrograms 
from the approach A 1 would have the same shape and 
merging pattern as in Figure 1(a). This suggests that^li is 
sensitive to the relative size of the combinations. 
Approach A 2 

The approach A 2 also used a hierarchical clustering proce- 
dure with the average linkage to directly cluster the breast 
cancer data. But in this approach, the dissimilarity was 
obtained by the p-value from the log-rank test. The den- 
drogram, shown in Figure 5(b), indicates that the merging 
steps on the top are not obvious for several combinations. 
The reason is simply that the dissimilarity 1 - p is 1 for 
most pairs of combinations, due to the rounding effect in 
computation. 
Approach A 3 

We employed A 3 to cluster the data by using only T and 
N. Survival times were not used with this approach. The 
corresponding dendrogram is shown in Figure 5(c). Com- 
paring Figure 5(c) with the survival curve plot in Figure 1 



(b), we can observe that the merging pattern described in 
the dendrogram at low levels of dissimilarity does not 
seem reasonable. For instance, the dendrogram indicates 
that T2N3 and T1N3 merge first and then they merge 
with T3N3 to form a group without T3N2, which is not 
reasonable in light of Figure 1(b). Therefore the traditional 
clustering procedure using T and N does not work here. 
The reason might be that T and N together could not cap- 
ture the main information regarding the survival of cancer 
patients. 

The approach A 3 can be modified by incorporating the 
learning step, as in EACCD. One modification, denoted by 
A3, is obtained by replacing dis 0 in the first step of 
EACCD by dis s As and then following steps 2 and 3 in 
EACCD with the average linkage. Figure 5(d) shows the 
dendrogram (m = 10000), which again presents unreason- 
able grouping assignments. 
Approach A 4 

We ran the PAM algorithm to directly partition the breast 
cancer data (combinations) for the number of clusters set 
at each of the following figures: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. 
And we obtained the corresponding partition by cutting 
off the dendrogram in Figure 1(a). Comparisons showed 
that the results from the PAM and EACCD were the same 
except for the case where the number of clusters was 4. 
Table 2 lists the partition results for four clusters from 
both methods, where a higher group number means a 
smaller survival in the group. Comparing the table with 
Figure 1(b), we see that the four clusters from EACCD are 
reasonable. However, groups 2 and 3 from the PAM show 
a separation of T2N1 from 7TJV2, which should be placed 
into the same group as indicated by the survival plot 
(Figure 1(b)). Therefore, partition of the data from 
EACCD is more consistent with the survival curves than 
that from the PAM. 

In summary, the results of these comparisons have 
shown that 1) if the step for learning dissimilarity is 
omitted in EACCD, then the resulting approaches can 
have a degraded performance, 2) if survival times are 
not taken into account, then clustering based on 
prognostic factors will likely generate misleading den- 
drograms, and 3) direct applications of partitioning 
techniques to the data can lead to misleading assign- 
ments to clusters. 

Conclusion 

This report presents a three pronged analysis of EACCD 
based on a breast cancer patient dataset. First, we exam- 
ined whether grouping patients by EACCD was consis- 
tent with the "natural" grouping of survival curves 
derived directly from the data. Second, we investigated 
the effect of different settings in EACCD. Third, we com- 
pared EACCD with other clustering approaches. The 
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Figure 5 Dendrograms from various clustering approaches 



results showed that if only the PAM is employed for 
learning dissimilarity, large values of m should be used 
with EACCD and that dendrograms generated from 
EACCD with the PAM and a large m primarily depend 



dis 

hclust (', "average") 
(d) Approach A3. 



on the linkage functions and not on the statistical tests 
that are used in the learning step. The results also 
showed that EACCD can be applied to cancer patient 
data to obtain meaningful dendrograms. 
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Table 2 Partition results for four clusters of SEER breast 
cancer data from 1 990-2000. 

EACCD PAM 

Group 1 T1N0 T1N0 

Group 2 T1N1.T2N0, T3N0 T1 N1, T2N0, T3N0, T2N1 

Group 3 T1N2, T2N1 Tl N2, T2N2, T3N1 

Group 4 T1N3, T2N2, T2N3, T3N1.T3N2, T3N3 Tl N3, T2N3, T3N2, T3N3 
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